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PRECOMPUTING STRATEGY FOR HAMILTONIAN MONTE CARLO 
METHOD BASED ON REGULARITY IN PARAMETER SPACE 

CHENG ZHANG*, BABAK SHAHBABAt, AND HONGKAI ZHAO* 


Abstract. Markov Chain Monte Carlo (MCMC) algorithms play an important role in statistical inference 
problems dealing with intractable probability distributions. Recently, many MCMC algorithms such as Hamiltonian 
Monte Carlo (HMC) and Riemannian Manifold HMC have been proposed to provide distant proposals with high 
acceptance rate. These algorithms, however, tend to be computationally intensive which could limit their usefulness, 
especially for big data problems due to repetitive evaluations of functions and statistical quantities that depend 
on the data. This issue occurs in many statistic computing problems. In this paper, we propose a novel strategy 
that exploits smoothness (regularity) in parameter space to improve computational efficiency of MCMC algorithms. 
When evaluation of functions or statistical quantities are needed at a point in parameter space, interpolation from 
precomputed values or previous computed values is used. More specifically, we focus on Hamiltonian Monte Carlo 
(HMC) algorithms that use geometric information for faster exploration of probability distributions. Our proposed 
method is based on precomputing the required geometric information on a set of grids before running sampling 
algorithm and approximating the geometric information for the current location of the sampler using the precomputed 
information at nearby grids at each iteration of HMC. Sparse grid interpolation method is used for high dimensional 
problems. Tests on computational examples are shown to illustrate the advantages of our method. 
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1. Introduction. Many statistical and machine learning methods rely on costly iterative al¬ 
gorithms for optimization or sampling. One of the most computationally intensive components 
of these methods is repetitive evaluations of functions, their derivatives, geometric and statistical 
quantities that depend on the data. This is especially challenging in Big Data problems. To reduce 
the computation cost, one common approach is subsampling, which restricts the computation to 
a subset of the data or sample, such as stochastic gradient methods (See, for example, [T71 Wj\ ). 
Another approach could be to find some computationally cheaper surrogate functions to substitute 
the expensive objective functions. (See, for example, [24l[25].) In this paper, we propose a different 
approach that explores smoothness or regularity in parameter space, which is true for most statisti¬ 
cal models. When evaluation of functions or statistics quantities are needed at a point in parameter 
space, interpolation from precomputed values or previous computed values is used. Here, we mainly 
focus on a state-of-the-art class of Markov Chain and Monte Carlo (MCMC) sampling algorithms 
called Hamiltonian Monte Carlo (HMC). However, our proposed method could be extended to other 
computationally intensive, iterative algorithms commonly used in statistics and machine learning. 

MCMC was first introduced by Metropolis [50] to simulate the distribution of states for a system 
of idealized molecules. Almost contemporarily, Alder and Wainwright |4] proposed a deterministic 
approach to molecules simulation called molecular dynamics (MD). In the following decades, the 
MCMC and molecular dynamics approaches have continued to develop in their respective areas. In 
1987, Duane, Kennedy, Pendleton and Roweth [13] made a remarkable breakthrough by developing 
a Hybrid Monte Carlo (HMC) algorithm based on combining MCMC and molecular dynamics 
approaches. This is also known as Hamiltonian Monte Carlo (HMC) in the literature. Neal [22] 
provided an extensive review of this method and presented several extensions. The basic idea 
is that starting from the current state of MCMC, one can use MD to generate trial moves (i.e., 
proposals within the Metropolis algorithm) that can move far from the current state (resulting 


^Department of Mathematics, University of California, Irvine, USA 
^Department of Statistics, University of California, Irvine, USA 


1 



in low autocorrelations) while keeping the acceptance probability high (by moving towards high 
probability regions). Therefore, the HMC sampling method can provide more rapid and efficient 
exploration of the parameter space than standard random walk proposals. However, HMC requires 
expensive gradient computations in order to simulate the Hamiltonian dynamics system. This could 
be infeasible in Big Data problems. Therefore, in recent years, there have been many attempts to 
improve computational efficiency of HMC and its variants. (See for example, [571 E dSl HI 113 El 
HHlig.) One possible strategy is to use small subsets of data at each iteration [mill]. As an 
alternative approach, the precomputing strategies we propose here can reduce the computation cost 
of HMC while maintaining the overall efficiency of the method by exploiting smooth dependence of 
parameters that exists in typical statistical models. 

Before we present our method, we first briefly review HMC in the following section. We then 
present our proposed method in Section [3l and evaluate its performance in Section |4l using several 
examples. In Section [5l we discuss an extension of our method that is faster computationally, but 
converges to an approximation of the target distribution. Finally, SectionlUis devoted to discussion 
of future research directions and applications of our method in other algorithms. 

2. Hamiltonian Monte Carlo. In Bayesian Statistics, we are interested in sampling from 
the posterior distribution of the model parameters q given the observed data, Y = (j/i, ?/ 2 , ■ • ■, Vn)'^, 

P{q\Y) !xex.p{-U{q)), (2.1) 

where the potential energy function U is defined as 

N 

U{q) =P{yi\q) -log P{q). (2.2) 

i=l 

Here, the first term is the negative log-likelihood, and P{q) is the assumed prior on model param¬ 
eters. The posterior distribution is almost always intractable. Therefore, Markov Chain Monte 
Carlo (MCMC) algorithms are typically used for sampling from the posterior distribution to per¬ 
form statistical inference. We could for example use the Metropolis algorithm as follows. Given 
the current state, q, we propose a new state, q* using a symmetric proposal distribution such 
that P{q*\q) = P{q\q*). We then accept the proposed state as our new state with the following 
probability: 

min(l,exp[17((?) -U{q*)]) 

The standard random walk Metropolis generates proposals by sampling from a normal distri¬ 
bution with its mean set to the current state, q. There are more efficient strategies to generate 
proposals. Among these, Hamiltonian Monte Carlo (HMC) has become increasingly popular due to 
its capability of making distant proposals (i.e., low autocorrelation) with high acceptance probabil¬ 
ity. More specifically, HMC introduces a Hamiltonian dynamics system with auxiliary momentum 
variables p to propose samples of g in a Metropolis framework that explores the parameter space 
more efficiently compared to standard random walk proposals. Following the dynamics of the intro¬ 
duced Hamiltonian system, HMC generates proposals jointly for {q,p) using the following system 
of differential equations: 


dqi dH 
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Algorithm 1: Hamiltonian Monte Carlo 

Input: Starting position and step size e 
for t = 1, 2, • • • do 

Resample momentum p 
pW ^Ar(0,M), (< 70 ,Po) = 

Simulate discretization of Hamiltonian dynamics: 
for I = \ to L do 

pi-i ^pi-i - 

qi ^ qi-i + eM-^pi_i 
\_Pi^Pi- 

{q*,p*) = (qL,PL) 

Metropolis-Hasting correction: 
u ~ Uniform [0,1] 
p = exp[ir(<7(‘),pW)-iI(<7*,p*)] 

_ if M < min(l,p), then = q* 


where the Hamiltonian function is defined as H{q,p) = 17(<7) + ^p^M~^p. The quadratic kinetic 
energy function K{p) = ^p^Ad~^p corresponds to the negative log-density of a zero-mean multi¬ 
variate Gaussian distribution with the covariance M. Here, M is known as the mass matrix, which 
is often set to the identity matrix, /, but can be used to precondition the sampler using Fisher 
information m- By simulating the Hamiltonian dynamics system together with the correction 
(i.e., accept/reject) step, HMC generates samples from a joint distribution of {q,p) defined by 

P{q,p) oc exp (^-U{q) - ^p^M~^p 

Notice that q and p are independent in general. 

Each sample from the HMC algorithm is generated by two steps: the proposal step and the 
correction step. In the proposal step, new values for the momentum variable p are drawn from their 
Gaussian distribution. Starting from the current state {q,p)-, the Hamiltonian dynamics system 
(1^.(1^ is simulated for L steps using the leapfrog method (Algorithm [T]), with a stepsize of e. 
Here, L and e are parameters which needs to be tuned to obtain a reasonable acceptance probability. 
In the correction step, the proposed state {q*,p*) at the end of the trajectory is accepted as the 
next state of the Markov chain with probability min(l,exp[—il(( 7 *,p*) -|- H{q,p)]) and the position 
variable q is updated correspondingly. These steps are presented in Algorithm [T] 

Note that in Algorithm [1] when simulating the Hamiltonian dynamics system, we need to 
repeatedly compute the gradient of the potential energy function U. This could be extremely time 
consuming. Many attempts have been made in recent years to reduce this cost in order to improve 
the overall computational efficiency of HMC. See for example, Shahbaba et al. [25] and Chen [10] . 
In this paper, we claim that the smooth dependence on parameters can be exploited to approximate 
the gradient effectively using precomputed or previous-computed values on a grid of points. We 
will discuss our method in details in the following section. 


3. Precomputing Strategies. 
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Fig. 3.1. 2D Gaussian example: (a) the graph of the first component ofV^U. The function value at red point 
can be obtained by interpolation when the function values at three blue points are known, (b) HMC samples from 


the posterior distribution. 

3.1. Insights from an Illustrative Example. We start with a simple example to motivate 
our approach. Consider a bivariate Gaussian distribution with known covariance matrix and a 
conjugate prior 


Y\^i ^ - A/'(/xo,i;o) 


Note that in this case, the posterior distribution has a closed form so MCMC is not required. 
However, we use this example to motivate our method. For this problem, the potential energy 
function and its gradient are given by 



(3.1) 


(3.2) 


In the gradient function (13.21) . all the information about the parameter is contained in one single 


value Y (i.e., the sufficient statistic for /i). Therefore, if we precompute Y, gradient computation 


of the potential energy function U could be reduced to a simple matrix vector multiplication. 


Moreover, the gradient function itself is a linear function. In this 2D case, the essential information 
of ^ can be captured by its function values at three non-collinear points (left panel of Fig lTTI) . On 


the other hand, samples from the posterior distribution are concentrated around the high density 
region where the neighborhood of one sample is frequently visited in the simulations of Hamiltonian 
dynamics (right panel of Fig l3.ll) . We use these insights to develop a method that can approximates 
the gradient function using precomputed values in order to accelerate standard HMC. 

3.2. Force Approximation. If we could solve Hamilton’s equations (I2.3l and l2.4l) analytically, 
the acceptance probability of new proposals in HMC would be exactly one (i.e., each proposal 
is accepted) because of the conservation of the Hamiltonian [22]. However, since solving these 
equations exactly is too hard in practice, we usually approximate them by discretizing time and 
using the leapfrog method (Algorithm [T]). As a result, the acceptance probability may be less than 
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one. The tradeoff between the accuracy of the proposal-generating mechanisms and Metropolis 
acceptance probability can go beyond time discretization. Therefore, in this paper we ask the 
following question: can we properly approximate the proposal-generating mechanism in order to 
reduce computational complexity while keeping the acceptance probability at a reasonable level? We 
answer this question in the remaining part of this section. 

Note that we can rewrite Hamilton’s equations as follows: 


dqi 

dt 

dpi 

dt 


[M-V]. 

dqi 


(3.3) 

(3.4) 


The routine of the trajectory for one proposal step will be determined by both the random initial¬ 
ization of the momentum and the negative gradient of the potential energy function, which is called 
force in Physics, 


F = - 


dq 


The random momentum enables the scheme to explore the target distribution stochastically, and 
the fictitious force guides the sampler in the right direction so that the entire sampling method 
would be more efficient than random walk proposals. However, the computation of the true force F 
is quite expensive. To alleviate this issue, we propose to construct a Hamiltonian dynamics system, 
at this time for the proposal step only, using an alternative Hamiltonian function. 


H{q,p) = U{q)-\- K{p) 


where U is an approximation to the potential energy U, whose corresponding negative gradient F 
(which is an approximation to the true force function F) can be computed relatively fast. Since the 
simulation of a Hamiltonian dynamics system only involves the force function, it suffices to find an 
approximate force function, F, directly. 

3.3. Naive Grid HMC. To approximate the force function, one could simply use a piecewise 
constant functions, which corresponds to a piecewise linear approximation of the potential energy. 
In most cases, the high density region of the posterior distribution can be covered by a finite domain 
D, henceforth called “domain of interest”. If we partition D with a fine grid, justification of an 
appropriate piecewise constant approximation to the force function F is guaranteed by the smooth 
dependence of F (or [/) in parameter space. For a 2-dimensional problem, suppose our domain of 
interest is D = [a, b] x [c, d]. Given the grid points 


Xi = a + iAx, j/j=c-|-jAy, i, j = 0,1,..., iVp 

where Ax, At/ are the corresponding grid sizes, for each cell, Cij = [xi-i,Xi] x [t/j_i,t/j], we 
approximate the force function by its value at the center of the grid, Cjj- = (xj_i/ 2 ) I/j-i/ 2 )' 

F{q) = = Fic,j), if q G Cq, 

By the smoothness of F, IjF’ — F||oo —>■ 0 as Ax, Ay 0. Therefore, we can always find some fine 
grid to achieve the desired approximation accuracy. 
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Fig. 3.2. Force map of a logistic regression model 


Figure lX^ shows a piecewise constant approximation to the force function of a logistic regression 
model with design matrix X = (1, Xi) and true parameter /3 = (—1,1)^, where Xi follows standard 
normal distribution. The binary responses Y = {yi,y 2 , ■ ■ ■ tVn)'^ are sampled independently from 
Bernoulli distributions 




Bernoulli(pi), 


exp{x^|3) 

1 + exp(a;j/3) 


Therefore, the likelihood function is 


N 

mx,Y)txY[pf{i-p,y-y^ 

and the potential energy function and the force function are 

(3.5) 

(3.6) 


N 


U{I3) = [yiX^l3 - log(l + exp{xil3))] 


2=1 


FiP) = -^=X^iY-P) 
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Algorithm 2: Naive Grid HMC 


Input: Starting position and step size e 
Precompute the approximate force map F : Fij 



id 


for t = 1, 2, • • • do 

Resample momentum p 

pW ^Ar(0,M), (<70,Po) = 

Simulate discretization of Hamiltonian dynamics: 
Find the position of <70 in the force map: (io, jo) 
for 1 = 1 to L do 
Pi-i ^ Pi-i + 
qi d- qi_i + 

Find the position of qi in the force map: (ii,ji) 

\_ Pi ^ Pi + 

{q*,P*) = iqL,PL) 

Metropolis-Hasting correction: 
u ~ Uniform [0,1] 

p = ,p'‘) 

_ if M < min(l,p), then (7d+i) = q* 


where P = {pi,P 2 , ■ ■ ■ tPnY'■ It can be seen from the graph that: (i) the approximate force “map” 
does point to the right direction so that it provides valid geometric information for HMC; (ii) the 
approximate force function also changes smoothly, which means that the numerical stability of the 
leap-frog scheme can be maintained with approximately the same step size as standard HMC. As a 
result, the proposed scheme with piecewise constant force functions would be consistent and stable. 
Therefore, we can precompute the piecewise constant function F in advance. When evaluating the 
force function in the simulation of the Hamiltonian dynamics system, we locate the cell {i,j) for 
the current parameter <7 and read the approximate function value F{q) from the precomputed force 
map. We summarize this approach in Algorithm [2] and refer to it as Grid HMC (GHMC). 

Our initial results showed that the Naive Grid HMC method would work well for simple prob¬ 
lems. However, implementation of this method in general involves two challenges. First, its ex¬ 
tension to high dimensional problems could be problematic because as the number of parameters 
increases, the number of grid nodes at which we need to evaluate the approximate force map grows 
exponentially. In other words, the method will encounter the curse of dimensionality. The sec¬ 
ond challenge is related to finding the domain of interest. We will address those two issues in the 
following two subsections respectively. 

3.4. Sparse Grid HMC. The sparse grid interpolation method use a special discretization 
technique to approximate a smooth function over a sparse grid of points (ElEllE]). More specifi¬ 
cally, it uses a hierarchical basis (a representation of a discrete function space that is equivalent to 
the conventional nodal basis) and a sparse tensor product construction. Discretization on sparse 
grids employs 0(A • log(A)‘^“^) grid points only, where d denotes the dimension and N denotes the 
number of grid points at the boundary in each coordinate direction (i.e., the mesh size is h = 1/N). 
Using piecewise linear basis functions, the interpolation accuracy could be ■ (log with 
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respect to the L 2 norm and Loo norm if the solution has bounded second mixed derivatives. Note 
that for a full grid, grid points are needed to achieve an approximation accuracy of 0{N~^). 

3.4.1. Smolyak’s formula. Assume that we want to approximate the smooth functions / : 
[0,1]'^ —>■ K using a finite number of function values (at support nodes). For the one dimensional 
case, the interpolation formula is given by 


rrii 

where i e N, X® = {x® € [0, l]|j = 1,... ,mi} are the support nodes, and a® € C'([0,1]) are the 
basis functions. We could use the following tensor product for multidimensional cases: 

rrii^ rrii^ 

(C/®®®-.-®C/®^)= ^ (3.7) 

ii=i jd=i 

However, the above product formula requires a large number (m^j • ■ • of support nodes, which 
are sampled on the full grid. Smolyak’s formula then can be applied here to reduce the number 
of support nodes while maintaining the approximation quality of the interpolation formula up to a 
logarithmic factor. With 17° = 0, define 

A® = C/®-C/®"\ VfeN 

Moreover, we put |i| = H + • • • + for i € N'’*. Then Smolyak’s algorithm is given by 

AUf) = E ® ® = A-lM) + E ® ® (3.8) 

\i\<q |i|=ij 

'-' 

^A^Af) 

for integers q > d, where Ad-i,d = 0. In fact, (j3.8l) can be presented in terms of the univariate 
interpolation formulas [26], 

AqM)= E 

g—cZ+l<|2|<(y ^ 
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Algorithm 3: Sparse Grid HMC 
Input: Starting position and step size e 

Precompute the hierarchical surpluses for Smolyak’s formula Ak+d,d of potential energy U 
for t = 1, 2, • • • do 

Resample momentum p 
-7V(0,M) 
iqo,po) = 

Simulate discretization of Hamiltonian dynamics 
for 1 = 1 to L do 

Pl-l t— Pl-l — Ak+d,d{U){qi-i) 
qi qi-i + 

\_pi^pi- Ak+dAU){qi) 

{q*,P*) = 

Metropolis-Hasting correction: 
u ~ Uniform [0,1] 

_ if M < min(l,p), then = q* 


Therefore, only the function values at the sparse grid 

Hq^d = U X • • • X XA (3.9) 

g—c/+l<|i|<g 

are needed to evaluate Aq^dif) ■ It is better to select the sets A* in a nested fashion (A* C A*+^) 
to obtain many recurring points with increasing q. 

3.4.2. Sparse grid and Multivariate hierarchical structure. There are many possibilities 
to construct nested sparse grids. As an example, Fig. 13.31 shows the Clenshaw-Curtis type sparse 
grids H^^ in two and three dimensional spaces. With appropriate sparse grid and basis functions 
a, the multivariate interpolation formula (13.81) can be implemented in a hierarchical form where 






The hierarchical surpluses 


>«;i) 

(3.10) 


wY = /(®j) - Ak+d-iAA) 

introduced by Bungartz [7] can be used to obtain an estimate of the current approximation error 
and terminate the algorithm automatically when a desired accuracy is reached. More detailed 
information about the construction of sparse grid and basis functions and the derivation of the 
hierarchical form is provided in the appendix. 

Using sparse grid interpolation (13.81) based on the Smolyak algorithm, we can generalize our 
Naive Grid HMG method to relatively higher dimensional problems. The hierarchical surpluses for 
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Fig. 3.4. Domains of interest using Laplace’s approximation for the logistic regression model. 


the energy function U can be precomputed with certain type of sparse grid and VAq^d can be called 
to replace the gradient computation (see Algorithm [3|) . 

3.5. Domain of Interest. The grid needs to be specified over a finite domain of interest such 
that there is a good balance between the cost and efficiency of the precomputing strategy. That is, 
we need to find an appropriate bounded domain that covers most of the high density region without 
creating cells that are rarely visited by the sampler. Note that for points outside of this domain of 
interest, one can still use the standard HMC method; however, there will not be any computational 
saving for these points. 

To find the domain of interest, we use Laplace’s approximation, 

q\Y^M{q,J-\q)) 

where q is the posterior mode which can be estimated using fast optimization methods, and J^{q) = 
Hjjiq) is the Hessian matrix at the point. Given a pre-specified probability, p, we can hnd a domain 
with probability p based on the above normal approximation. 

Figure 1331 shows the domain of interest R for a logistic regression model. It can be seen that 
for different data sizes {N = 100 and 1000), the corresponding domains of interest are adjusted 
automatically to capture the high density regions of the posterior distribution. 

When the high density region is irregular and can not be represented well by a rectangular box, 
this might not be an efficient approach. Later, we will discuss a more general approach for such 
cases. 

4. Experiments. In this section, we compare our proposed method to standard HMC us¬ 
ing several experiments in terms of sampling efficiency. We define sampling efficiency as time- 
normalized effective sample size (ESS). Given B MCMC samples for each parameter, we calculate 
the corresponding ESS = B[1 + 2E^^7(fc)]“^, where E^^ 7 (fc) is the sum of K monotone sample 
autocorrelations M- We use the minimum ESS over all parameters normalized by the CPU time, 
s (in seconds), as the overall measure of efficiency: min(ESS)/s. 

Empirical results show that both GHMC and Sparse Grid HMC (SGHMC) provide substantial 
improvement over standard HMC in terms of efficiency while maintaining relatively high acceptance 
rates. 
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Fig. 4.1. HMC vs GHMC: logistic regression 


4.1. Logistic regression. For our first example, we sample N = 100 data points from a 
logistic regression model discussed in Section [3] and choose the domain of interest to be [—3,0.5] x 
[—0.5, 3] and set the grid size to 0.1. Figure l4d] shows posterior samples using standard HMC and 
GHMC. Note that they both converge to the target distribution and explore the parameter space 
quite well. Table HTT] compares the performance of these algorithms based on 3200 MCMC iterations 
after burning the first 800 iterations. As we can see, GHMC outperforms standard HMC in terms 
of time-normalized ESS. 

Table 4.1 

Comparing HMC with CHMC using a logistic regression model. For each method, we provide the acceptance 
rate (AR), the CPU time (s) for each iteration and the time-normalized ESS 


Method 

AR 

ESS(/3o, /3i) 

s/Iteration 

min ESS/s 

HMC 

GHMC 

0.9225 

0.7981 

(3200,3200) 

(3200,3200) 

7.0157F;-4 

3.318F;-4 

1425.3707 

3013.9031 


4.2. Banana-shaped distribution. The potential energy function for the logistic regres¬ 
sion model is quite similar to a Gaussian distribution model, where the resulting force function 
is relatively smooth. To investigate GHMC’s ability to explore the parameter space with a more 
complicated geometry, we construct a banana-shaped posterior distribution of ,5 = (/3i,/?2|y) based 
on the following model: 

j/l/3 - A/'(/3i + I3l,al) 

The data are generated with = 1, <^y = 2, (T /3 = 1. The potential energy function is 


N 


U[I3) = ^ - Pi - /^l)^ Pi + P2 


2al 


2a" 


and the force function is 


du E*=i(yi-/3i-/3|) 


(4.1) 


2/32 


(4.2) 
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HMC 


GHMC 


Fig. 4.2. HMC vs GHMC: banana-shaped distribution 


Here, we choose the domain of interest to be [—4,4] x [—4,4] and set grid size to 0.1. Figure 
14.21 shows the samples for the posterior distribution using standard HMC and GHMC. As before, 
both methods converge to the target distribution and explore the parameter space quite well. Even 
though Banana-shaped distribution is more distorted and the force function is more complex, F 
(grid size 0.1) still provide a good approximation to the true force function. Table [421 compares 
the performance of these algorithms based on 3200 MCMC iterations after burning the first 800 
iterations. As before, GHMC outperforms standard HMC in terms of time-normalized ESS. 

Table 4.2 

Comparing HMC with GHMC using a banana-shaped distribution model. For each method, we provide the 
acceptance rate (AR), the CPU time (s) for each iteration and the time-normalized ESS 


Method 

AR 

ESS(/3i, /32) 

s/lteration 

min ESS/s 

HMC 

GHMC 

0.9353 

0.6587 

(2403,1191.6) 

(893.8862,766.2423) 

3.8703F;-4 

1.4498F;-4 

962.1346 

1651.5917 


4.3. Gaussian Process model. For our third example, we use a Gaussian process model. 
Posterior sampling for these models tends to be quite difficult due to the computation cost associated 
with inverting the covariance matrix. See Neal [21] and Rasmussen [23j for more details on Gaussian 
process. Here we construct a 2D Gaussian process with zero mean and the squared exponential 
covariance function. 


Y ~ A/'(0,I]), Sy = ?7 • exp (-ZJJxi - Xj\\l) + J ■ 5ij 
where rj, I, J are positive hyperparameters with log-normal priors. 

log(7?) ~ A/'(-l, 1), log(0 - M{-1, 1), log( J) ~ 1) 

Let R = log(? 7 ), I = log(?), J = log(J), the potential energy function is 


U{fj, I J) ■- 


: i log(isi) + + l[{fi + 1)" + {i + 1)" + (J +1)^ 
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HMC 


SGHMC 


Fig. 4.3. HMC vs SGHMC: Gaussian Process 


and the force function is 


m = 


dU 






Y + P + 1, 




The domain of interest for /3 = {fj, I, J)^ is set to be [—1.6,1.6] x [—1.6,1.6] x [—1.2,0.4] where we 
train a sparse gird interpolator to replace the force function. Figure IT751 shows the samples from the 
posterior distribution given by standard HMC and SGHMC. Table. lT3l compares the performance 
of the two algorithms based on 3200 MCMC iterations after 800 burn-in iterations. As we can see, 
SGHMC substantially outperform standard HMC. 


Table 4.3 

Comparing HMC with SGHMC using a Gaussian process model. For each method, we provide the acceptance 
rate (AR), the CPU time (s) for each iteration and the time-normalized ESS 


Method 

AR 

ESS(r/, 1, J) 

s/Iteration 

min ESS/s 

HMC 

SGHMC 

0.9472 

0.7066 

(1021.7,1784.8,3200) 

(828.7,1380.0,3200) 

2.3547E-1 

2.9851E-2 

1.3559 

8.6752 


4.4. Elliptic PDE Inverse Problem. Our last example is a canonical inverse problem in¬ 
volving inference of the diffusion coefficient in an elliptic PDE ([lain]). The forward model is to 
solve a two dimensional elliptic PDE 

V^-{c{x,e)V^u{x,9))=Q (4.3) 

where x = {xi,X 2 ) G [0,1]^ is the spatial coordinate. The boundary conditions are 

u{x,e%^^Q = xi, u{x,e%^^i = l-xi 


du{x, 9) 
dxi 




= 0 , 


du{x, 9) 
dxi 


Xi — 1 


= 0 


This PDE provides a simple model of steady-state flow in porous media. The coefficient c represents 
the permeability of a porous medium while u represents the pressure head. In this inverse problem. 
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HMC SGHMC 

Fig. 4.4. HMC vs SGHMC: an elliptic PDE inverse problem 


the objective of interest is to infer the unknown diffusion coefficient conditioned on observation 
data where Bayesian approach can be naturally adopted. A log-Gaussian process prior is given to 
the diffusivity field c{x) with mean zero and an isotropic squared-exponential covariance kernel: 

^ 2 f \\xi-X2 

C{xi,X 2 ) = a exp I- 

for which we choose variance cr^ = 1 and a length scale I = 0.2. With this prior, the field can be 
easily parameterized with a Karhunen-Loeve (K-L) expansion: 

/ d 

c{x,6) Ki exp y^ 9iC\iVi{x) 

Vi=l 

where Xi and Vi{x) are the eigenvalues and eigenfunctions of the integral operator defined by the 
kernel (7, and the parameter 9i are endowed with independent standard normal priors, 9i ~ 1), 

which are the targets of inference. To reduce the dimension of this inference problem, the Karhunen- 
Loeve expansion is truncated at the first five modes (d = 5) and the corresponding mode weights 
{9i,... ,9^) are conditioned on data. Data are generated by combining observations of the solution 
field (solve the PDE on a finer 51-by-51 grid) on a uniform 11 x 11 grid covering the unit square 
with additive independent Gaussian noise. 

yj = u{xj,9) + Cj, Cj Af{0, 0.1^) 

Gonsistent with results from previous examples, SGHMG performs substantially better than HMG 
(Table HU). 

5. Approximate Target Distribution. So far, our proposed method has been based on 
using the exact target distribution and approximating the proposal generating mechanism only. We 
can improve the computation speed even more by using grid approximation for U in the correction 
step (accept/reject step) as well. In this case, the resulting sampler actually samples from an 
approximate distribution 



Q{q) oc exip{-U{q)) 











































Precomputing strategy for HMC 


15 


Table 4.4 

Comparing HMC with SGHMC using an elliptic PDE inverse problem. For each method, we provide the 
acceptance rate (AR), the CPU time (s) for each iteration and the time-normalized ESS 


Method 

AR 

ESS 

s/Iteration 

min ESS/s 

HMC 

SGHMC 

0.7719 

0.6141 

(991.8,2091.2,2831.0) 

(855.7,1325.7,1937.5) 

2.02F;-1 

6.1952A-2 

1.5343 

4.3165 


instead of the target posterior distribution P{q). The bound of the difference between these two 
distributions measured by the Kullback-Leibler divergence is shown in the following theorem. 

Theorem 5.1. If U and V are energy functions corresponding to probability distributions P 
and Q 


P{q) oc exp(-17(g)), Q{q) oc exp(-y(g)) 
then the Kullback-Leibler divergence between P and Q is bounded by 

DKLiP\\Q)<2\\U-V\\oo 


Proof. 

D,UPllQ)=lPi,nn[^^) d, 

= [ P{q){V{q)-U{q))dq+ [ P{q)\n(^^) dq 

where 

Ip= exp{-U{q)) dq, Iq= exp{-V(q)) dq 

JR-D Jr-d 

since 

Iq=[ exp{-V (q)) dq = f exp{-U{q)) ■ exp{-{V{q) - U{q))) dq 

JK-D J^D 

< exp(||l/ - 17||oo) ■ f exp(-[/(g)) dq = exp(||y - 17||oo) • Ip 

JRn 

we have 

DKLiPWQ) < I|b" - U\\oo ■ f P{q) dq + ||y - 17||oo • [ P(q) dq = 2||F - U\\^ 

Js.n J^D 

□ 

Note that if the potential energy function [/ is a smooth function, \\tj — U\\oo ^ 0 as the 
grid size goes to 0. By Theorem. 15.11 the resulting sampler will eventually converge to the target 
sampler. 

We apply this method, called GHMC-complete, to the logistic regression and banana-shaped 
distribution examples discussed above. The results are shown in Figures l5.ll and l5.21 As we can see. 
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HMC GHMC GHMC-complete 


Fig. 5.1. HMC vs GHMC: logistic regression 



HMC 


GHMC GHMC-complete 

Fig. 5.2. HMC vs GHMC: hanan-shaped distribution 


the posterior samples given by GHMC-complete in both cases match the exact samplers (HMC and 
GHMC) quite well. With appropriate grid size (around the step size), GHMC-complete can provide 
a high quality approximation to the standard HMC sampler. At the same time, computational 
efficiency has been substantially improved due to the fast computation of potential energy function 
in the correction step 1 Tables IffiT] and lOl) . 

For the logistic regression example. Figure [531 shows the prediction accuracy vs. the run time 
for the three algorithms based on a test set. As we can see, the prediction accuracy (measured in 
terms of the average log-likelihood on the test data) of GHMC-complete increases faster compared 
to the other two methods. For computationally intensive models, the advantage of GHMC-complete 
will be more significant since the computation cost of the potential energy function becomes more 
expensive. 

6. Discussion. Due to its ability of producing distant proposals with high acceptance proba¬ 
bility, HMC can provide rapid exploration of the parameter space when sampling from the posterior 
distribution. However, the gradient computation to obtain essential geometric information prevents 
its application on computationally intensive problems when the data size is large. To address this 
issue, we have proposed a relaxed framework, where HMC can take advantage of the smoothness of 
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Fig. 5.3. The average log-likelihood on test data for a logistic regression model 


Table 5.1 

Comparing HMC with GHMC using a logistic regression model. For each method, we provide the acceptance 
rate (AR), the CPU time (s) for each iteration and the time-normalized ESS 


Method 

AR 

ESS(/lo, /?i) 

s/Iteration 

min ESS/s 

HMC 

0.9225 

(3200,3200) 

7.0157F;-4 

1425.3707 

GHMC 

0.7981 

(3200,3200) 

3.318F-4 

3013.9031 

GHMC-complete 

0.7931 

(3191.8,3200) 

2.9237F;-4 

3411.5275 


Table 5.2 

Comparing HMC with GHMC using a banana-shaped distribution model. For each method, we provide the 
acceptance rate (AR), the CPU time (s) for each iteration and the time-normalized ESS 


Method 

AR 

ESS(/li, (32) 

s/Iteration 

min ESS/s 

HMC 

0.9353 

(2403,1191.6) 

3.8703A-4 

962.1346 

GHMC 

0.6587 

(893.8862,766.2423) 

1.4498F-4 

1651.5917 

GHMC-complete 

0.6697 

(980.1443,796.2977) 

1.2279F-4 

2026.6108 


the potential energy function U in parameter space to accelerate computation by using grid-based 
precomputing strategies. The key idea is to approximate the force field generated by the potential 
energy function U through interpolation of those precomputed field at grid points in each HMC it¬ 
eration. Based on these ideas, two simple grid based algorithms, Naive Grid HMC and Sparse Grid 
HMC, are proposed and evaluated on several problems. Empirical results show that our approach 
can capture the main information needed for HMC’s implementation at a lower computational cost. 
As a result, our method tends to be more effective than standard HMC. 

While quite effective in relatively low dimensional problems, extension of grid-based HMC to 
high dimensional problems could be quite challenging. Future research direction could involve 
finding effective strategies to alleviate this issue. 

Another direction is to find more efficient method to locate the domain of interest. In subsection 
13.51 we used Laplace’s approximation for this purpose. As shown in Figure [5^ this strategy might 
not be effective when the resulting Gaussian distribution is not a good approximation for the target 
distributions. An alternative and more general approach is based on following the trajectories of 
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(a) logistic regression 


(b) banana shaped distribution 


Fig. 5.5. Domains of Interest via early trajectories. 


the burn-in samples. Even though MCMC samplers might not converge to the target distribution 
in the early stage, those trajectories can capture the high density region to some extent. Figure [531 
shows the cells visited by those early trajectories for the logistic regression example and the banana 
shaped distribution example. 

The proposed precomputing strategy is not limited to HMC only. In fact, it can be integrated 
with other MCMC methods involving expensive computation of redundant information. For exam¬ 
ple, Fisher information matrices can be precomputed at each cell center to accelerate Riemannian 
Manifold HMC [15]. 
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Fig. A.l. Piecewise linear Nodal basis (a) and hierarchical functions (b) with support nodes Xj G * = 1) 2, 3 
for the Clenshaw-Curtis grid 


Appendix. More On Sparse Grid. 

A.l. Construction of sparse grid and basis functions. The Clenshaw-Curtis type sparse 
grid introduced in subsection 13.4.21 is constructed from the following formulas. Here, the cc* 

are defined as 


f (j - l)/(wi - 1), j = 

\ 0.5, j = l,mi = 1 


In order to obtain nested sets of points, the number of nodes is given by 

mi = 1 and mi = 2*“^ + 1 for i > 1 

Piecewise linear basis functions a can be used for the univariate interpolation formulas C/*(/). 


z}(a;) = 1, 


a){x) = 


1 - (m* - 1) • |a; - x’j\ 

0 , 


\x - a;* I < ^/{m^ - 1), 
otherwise 


for i > 1 and j = 1,..., . 


A.2. Derivation of the hierarchical form. With the selection of nested sets of points, we 
can easily transform the univariate nodal basis into the hierarchical one. By definition, we have 

A\f) = U\f)-W-\f) 

rrii rrii 

i=i 1=1 

rrii 

= J2{f{x))-U^-\f){x)))-a^ 

1=1 
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0 0.25 0,5 0,75 1 


(a) Nodal basis 


(b) Hierarchical basis 


Fig. a.2. Piecewise linear interpolation: Nodal versus Hierarchical 


since f{x^j) - W ^{f)(xj) = 0, V x® e X® \ 

AX/)= E (A.l) 

From (lA.ip we note that for all A®(/), only the basis functions belonging to the grid points that have 
not yet occurred in a previous set 1 < fc < i —1 are involved. Fig lA.ll gives a comparison of the 

nodal and the hierarchical basis functions and Fig lA.2l shows the construction of the interpolation 
formula using nodal basis functions and function values versus using hierarchical basis functions 
and hierarchical surpluses for a univariate function / . Both figures are reproductions based on 
Klimke and Wohlmuth |18) . 

Applying the tensor product formula (13.71) with A® given in (lA.ll) . the hierarchical update in 
the Smolyak algorithm (13.81) now can be rewritten as 

AA,,rf(/)= E(A*^®-'-®A®^)(/) 

1 * 1=9 

= E E ••• E 

l»l=9x*iex^ 


3d' 
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